그래프 기반 분산처리 시스템 트리니티를 이용한 서열 정렬 알고리즘

이준수; 여윤구; 노홍찬; 윤영미; 박상현; Jun-Su Lee; Yun-Ku Yeu; Hong-Chan Roh; Young-Mi Yoon; Sang-Hyun Park

연구문헌

국내 학회지

홈 > 연구문헌 > 국내 학회지 > 데이터베이스 연구회지(SIGDB)

데이터베이스 연구회지(SIGDB)

Current Result Document :

한글제목(Korean Title)	그래프 기반 분산처리 시스템 트리니티를 이용한 서열 정렬 알고리즘
영문제목(English Title)	SAG: Sequence Alignment Algorithm based on Graph with distributed system Trinity
저자(Author)	이준수 여윤구 노홍찬 윤영미 박상현 Jun-Su Lee Yun-Ku Yeu Hong-Chan Roh Young-Mi Yoon Sang-Hyun Park
원문수록처(Citation)	VOL 30 NO. 01 PP. 0017 ~ 0028 (2014. 04)
한글내용 (Korean Abstract)	유전체학(Genomics)에서 서열정렬은 가장 널리 사용된다. 차세대 시퀀싱(Next Generation Sequencing) 기술이 발전하면서, 최근 서열 리드 데이터의 양이 급격하게 증가했다. 급증한 차세대 시퀀싱 데이터를 처리하기 위한 서열정렬 알고리즘이 많이 개발되었다. 하지만 서열정렬 알고리즘들은 반복서열(repeat), 변이(polymorphism)를 처리하기 위해 많은 계산량을 요구한다. 그렇기 때문에 기존 서열정렬 알고리즘은 처리량(throughput)과 정렬품질(quality)사이에 트레이드오프(trade-off)가 존재한다. 하지만 분산처리 시스템 Hadoop, Trinity에서 동작하는 정렬 알고리즘은 기존 싱글에서 동작하는 알고리즘에 비해 정렬 품질을 덜 희생하고, 더 높은 처리량을 얻을 수 있다. 본 논문에서는 Microsoft에서 제안한 그래프 기반 인-메모리(in-memory) 분산시스템 트리니티(Trinity)에서 동작하는 서열정렬 알고리즘 SAG(Sequence Alignment Algorithm based on Graph with Trinity)를 제안한다. 우리는 기존 참조 서열을 그래프 형태의 데이터로 변형 한 뒤, 그래프에서 연결 가능한 인접한 노드에 새로운 간선을 추가했다. 그리고 변이(polymorphism)를 허용하는 정렬을 수행하기 위해 서열조각들 사이의 조합을 통해 후보를 얻었다. 마지막으로 후보를 대상으로 glocal alignment를 수행해 최종적인 결과를 찾았다. 실험을 통해 SAG는 기존 Hadoop에서 동작하는 알고리즘과 비교했을 때 비슷하거나 더 좋은 정렬 품질조건과 동시에 상당히 높은 처리량을 얻었다. 또한 머신을 추가함으로써 더 좋은 처리량을 얻는 확장성을 입증하였다.
영문내용 (English Abstract)	Sequence alignment is one of the widely used tools in genomics. Recently, after NGS(Next Generation Sequencing) technology was developed, the production of sequence read data increased dramatically. A number of sequence alignment algorithms have been developed for processing these NGS data. However, these algorithms are suffered from a trade-off between throughput and alignment quality, because there is a large computation cost for handling the repeat reads and polymorphism. On the contrary, alignment algorithms with distributed system such as Hadoop and Trinity can obtain better throughput without compromising alignment quality than existing algorithms on single machine. In this paper, we suggest SAG, sequence alignment algorithm based on graph with in-memory distributed system, Trinity proposed by Microsoft. We transformed reference sequence into a graph form, and added new edge between adjacent node having connection possibility on graph. And we performed combination of sequence fragments in order to candidates allowing polymorphism. Finally, we performed glocal alignment to find final results for the obtained candidates. Our experimental results show that SAG better throughput with same quality or better quality than existing algorithms with Hadoop. We have also proved scalability that we obtained better throughput by simply adding machines.
키워드(Keyword)	서열정렬 알고리즘 그래프 분산처리시스템 차세대 시퀀싱 sequence alignment algorithm graph distributed system NGS (next generation sequencing)
파일첨부	PDF 다운로드